-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF/REF: Check use of numexpr earlier in the DataFrame operation #41122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF/REF: Check use of numexpr earlier in the DataFrame operation #41122
Conversation
---------- | ||
op : operator | ||
size : int | ||
dtypes : list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list[DtypeObj]?
else: | ||
use_numexpr = expressions.can_use_numexpr(func, None, None, right) | ||
|
||
array_op = ops.get_array_op(func, use_numexpr=use_numexpr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the get_array_op on L6832 still needed?
@@ -6866,6 +6886,17 @@ def _dispatch_frame_op(self, right, func: Callable, axis: int | None = None): | |||
# maybe_align_as_frame ensures we do not have an ndarray here | |||
assert not isinstance(right, np.ndarray) | |||
|
|||
if isinstance(self._mgr, ArrayManager): | |||
use_numexpr = expressions.can_use_numexpr( | |||
func, self.shape[1], (right.dtype,), None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does self.shape[1] correspond to the size of some array?
else: | ||
use_numexpr = expressions.can_use_numexpr( | ||
func, None, (right.dtype,), None | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these ~8 lines (x4) seem like a reasonable cope for an AM/BM method
@@ -135,7 +135,7 @@ def _masked_arith_op(x: np.ndarray, y, op): | |||
return result | |||
|
|||
|
|||
def _na_arithmetic_op(left, right, op, is_cmp: bool = False): | |||
def _na_arithmetic_op(left, right, op, is_cmp: bool = False, use_numexpr=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
annotate, docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can use_numexpr be keyword-only
any estimates for perf impact? |
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
#41161 and #40463 have been merged. can you address @jbrockmendel comments. |
Parameters | ||
---------- | ||
op : operator | ||
size : int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int or None?
def _can_use_numexpr(op, op_str, a, b, dtype_check): | ||
""" return a boolean if we WILL be using numexpr """ | ||
if op_str is not None: | ||
|
||
# required min elements (otherwise we are adding overhead) | ||
if a.size > _MIN_ELEMENTS: | ||
if isinstance(b, str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the place where this is moved from has a comment # can never use numexpr
. not a huge loss since the comment isn't that informative, but at some point we should get a helpful comment about why it cant be used
Thanks for the PR but appears to have gone stale. Closing to clear the queue, but feel free to reopen if you find the time to revisit this PR and the comments. |
This PR tries to gather some checks about whether numexpr can be used or not which can be done upfront on the DataFrame level, before dispatching to the array_ops.
For example, for a scalar right operand, we can check it only once if it's compatible with numexpr usage. Or for ArrayManager, we can check the size constraint once upfront (since each array has the same size). For those cases, this avoids overhead in
expressions.evaluate
.To this end, I created a new
can_use_numexpr
that starts to gather some checks (there might be more that can be added), call this up front inDataFrame._dispatch_frame_op
(this could IMO be bit cleaner with #39772, but the if/else checks inside_dispatch_frame_op
work as well). And then the result ofcan_use_numexpr
is passed through to the array_ops.